-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression for pool creation timeout retry #887
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
tiagolobocastro
commented
Nov 20, 2024
•
edited
Loading
edited
Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Signed-off-by: Tiago Castro <[email protected]>
A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <[email protected]>
tiagolobocastro
requested review from
sinhaashish,
Abhinandan-Purkait,
abhilashshetty04 and
dsharma-dc
November 20, 2024 00:19
Resolves openebs/mayastor#1772 |
dsharma-dc
approved these changes
Nov 20, 2024
control-plane/agents/src/bin/core/controller/resources/operations_helper.rs
Outdated
Show resolved
Hide resolved
When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <[email protected]>
tiagolobocastro
force-pushed
the
pool-timeout
branch
3 times, most recently
from
November 25, 2024 12:59
9607994
to
405f633
Compare
Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <[email protected]>
tiagolobocastro
force-pushed
the
pool-timeout
branch
2 times, most recently
from
November 25, 2024 15:27
b0afe9d
to
83ac4b4
Compare
7 tasks
bors try |
tryBuild failed: |
Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <[email protected]>
Not sure why this is starting to fail now... even on an unchanged release branch it's failing now!? Signed-off-by: Tiago Castro <[email protected]>
tiagolobocastro
force-pushed
the
pool-timeout
branch
from
November 25, 2024 19:32
83ac4b4
to
8caf2a6
Compare
bors try |
tryBuild succeeded: |
Abhinandan-Purkait
approved these changes
Nov 26, 2024
bors merge |
Build succeeded: |
tiagolobocastro
added a commit
that referenced
this pull request
Nov 26, 2024
887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <[email protected]> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <[email protected]> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <[email protected]> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <[email protected]> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <[email protected]>
bors-openebs-mayastor bot
pushed a commit
that referenced
this pull request
Nov 26, 2024
890: Backport fixes to release/2.7 r=tiagolobocastro a=tiagolobocastro chore(bors): merge pull request #887 887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <[email protected]> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <[email protected]> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <[email protected]> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <[email protected]> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <[email protected]> --- fix(resize): atomically check for the required size Ensures races don't lead into volume resize failures. Signed-off-by: Tiago Castro <[email protected]> --- test(bdd/thin): fix racy thin prov test Add retry waiting for condition to be met. Signed-off-by: Tiago Castro <[email protected]> --- feat(topology): remove the internal labels while displaying Signed-off-by: sinhaashish <[email protected]> --- fix(fsfreeze): improved error message when volume is not staged Signed-off-by: Abhinandan Purkait <[email protected]> --- fix(deployer): increasing the max number of allowed connection attempts to the io-engine Signed-off-by: sinhaashish <[email protected]> --- fix(topology): hasTopologyKey overwites affinityTopologyLabels Signed-off-by: sinhaashish <[email protected]> Co-authored-by: sinhaashish <[email protected]> Co-authored-by: Abhinandan Purkait <[email protected]> Co-authored-by: Tiago Castro <[email protected]> Co-authored-by: mayastor-bors <[email protected]>
tiagolobocastro
added a commit
that referenced
this pull request
Nov 26, 2024
887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <[email protected]> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <[email protected]> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <[email protected]> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <[email protected]> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <[email protected]>
tiagolobocastro
added a commit
that referenced
this pull request
Nov 26, 2024
887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <[email protected]> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <[email protected]> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <[email protected]> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <[email protected]> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <[email protected]> Signed-off-by: Tiago Castro <[email protected]>
bors-openebs-mayastor bot
pushed a commit
that referenced
this pull request
Nov 26, 2024
890: Backport fixes to release/2.7 r=tiagolobocastro a=tiagolobocastro chore(bors): merge pull request #887 887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <[email protected]> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <[email protected]> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <[email protected]> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <[email protected]> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <[email protected]> --- fix(resize): atomically check for the required size Ensures races don't lead into volume resize failures. Signed-off-by: Tiago Castro <[email protected]> --- test(bdd/thin): fix racy thin prov test Add retry waiting for condition to be met. Signed-off-by: Tiago Castro <[email protected]> --- feat(topology): remove the internal labels while displaying Signed-off-by: sinhaashish <[email protected]> --- fix(fsfreeze): improved error message when volume is not staged Signed-off-by: Abhinandan Purkait <[email protected]> --- fix(deployer): increasing the max number of allowed connection attempts to the io-engine Signed-off-by: sinhaashish <[email protected]> --- fix(topology): hasTopologyKey overwites affinityTopologyLabels Signed-off-by: sinhaashish <[email protected]> Co-authored-by: sinhaashish <[email protected]> Co-authored-by: Abhinandan Purkait <[email protected]> Co-authored-by: Tiago Castro <[email protected]> Co-authored-by: mayastor-bors <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.